knitr::opts_chunk$set(echo = TRUE)
The problem, as the title suggests, is the detection of malaria using cell images. Malaria is a life-threatening disease caused by Plasmodium parasites. Usually the cells are identified and counted manually which is burdensome and could also be a cause of human error. These basic tasts are quite expensive in lesser developed countries where they are needed the most. If this process could be automated using image processing techniques, it would help in reducing the costs and speeding up the process. There have been recent researches in this regards, which I aim to explore and make a model to classify cell images. Through this project, I aim to build a classification model which can classify cell images to identify if they are infected or not.
The dataset consists of 27,558 images of segmented cells from the thin blood smear slides images from the Malaria research activity. The research was conducted in Chittagong Medical College Hospital, Bangladesh. 200 participants took part in the research activity, of which 50 were healthy while the rest were infected with malaria. Images were taken of different cells from each patient, and each from a different angle. With the images, we are also provided a csv file which tells which if the image belongs to a patient or a healthy person.
set.seed(121)
library(EBImage) #for image processing
library(randomForest) #for generating the random forest model
library(rpart) #for building decision tree
library(rpart.plot) #for plotting decision tree
library(ggplot2) #for visualization
The data was divided into two folders, ‘Parasitized’ and ‘Uninfected’ containing equal number of images, alonside, we were provided two CSV files, to help us identify which image was parasitized and which image was uninfected. I took files from both folders, and set up a new CSV file (code attached), that had information about both types of cell, and then put all the images into one folder, so that it is easier to work with in the future. Now we load, the CSV file containing information about both types of cells.
df <- read.csv(file = 'data_set.csv')
head(df)
Now that we have all the data in one folder, and all the information in one file, we can start exploring the dataset to develop a ground level understanding of the data that we have.
labels = c("Uninfected", "Parasitized")
pie(table(df$status), labels = labels, main="Distribution of Parasitized and Uninfected cells")
We see that we have an equal representation of uninfected and parasitized cells. Now we try to explore further, and see randomly sampled images from each group.
set.seed(121) #setting seeds for the sample
p_images = df$image[df$status == 1]
sample = sample(length(p_images), 5)
sample = p_images[sample]
for (i in 1:5){
image = readImage(paste('images\\', sample[i], sep = ''))
(display(image))
}
p_images = df$image[df$status == 0]
sample = sample(length(p_images), 5)
sample = p_images[sample]
for (i in 1:5){
image = readImage(paste('images\\', sample[i], sep = ''))
(display(image))
}
We see that the parasitized cells have apparant patches, which the uninfected cells do not have. This is important as it will help us in identifying which cells are parasitized or uninfected. Furthermore, the patches have a similar color pattern, and that is something which could be used to further help with our predictions.
Now that we have seen sample images, and have observed that each parasitezed cells have a patch on them, we will now try to quantify that patch such that it could be included in our model for training.
We sample five random images from the sample.
set.seed(3456)
sample_images = sample(nrow(df), 5)
status = df$status[sample_images]
head(status)
## [1] 1 0 1 0 1
sample_images = df$image[sample_images]
#16231
for (i in 1:5){
image = readImage(paste('images\\', sample_images[i], sep = ''))
display(image)
}
We try to plot their histograms, so see if there is any visible different between the parasitzed and infected cells.
for (i in 1:5){
image = readImage(paste('images\\', sample_images[i], sep = ''))
hist(image)
}
We do not observe any apparant difference between the color concentrations of each of the colors. This is to be expected, since the images are too large in size and that a patch is too small to make an impact in the histogram. We try to address this concern a bit later. Now we use linear filters, with brushes to identify different regions in our cell images, each color in the images below represents a different region.
getRegions = function(image)
{
w = makeBrush(size = 51, shape = "gaussian", sigma = 1)
nucSmooth = filter2(getFrame(image, 1), w)
cellsSmooth = Image(dim = dim(image))
sigma = c(1, 3, 3)
for(k in seq_along(sigma))
cellsSmooth[,,k] = filter2( image[,,k], filter = makeBrush(size = 51, shape = "gaussian", sigma = sigma[k]) )
py = seq(-1, +1, length.out = dim(cellsSmooth)[1])
px = seq(-1, +1, length.out = dim(cellsSmooth)[2])
illuminationGradient = Image(outer(py, px, function(x, y) exp(-(x^2+y^2))))
nucBadlyIlluminated = cellsSmooth[,,1] * illuminationGradient
disc = makeBrush(21, "disc")
disc = disc / sum(disc)
localBackground = filter2(nucBadlyIlluminated, disc)
offset = 0.025
nucBadThresh = (nucBadlyIlluminated - localBackground > offset)
nucThresh = (cellsSmooth[,,1] - filter2(cellsSmooth[,,1], disc) > offset)
nucSeed = bwlabel(nucThresh)
}
for (i in 1:5){
image = readImage(paste('images\\', sample_images[i], sep = ''))
nucSeed = getRegions(image)
display(colorLabels(nucSeed))
}
The linear filtering does a good job in indentifying different regions, and labeling them differently, but there are few issues assoicated with it. It fails when the cell boundary touches the edge of the image (2nd figure), similarly, it does not identify all the patches which are at the edge of the cell boundary (figure 5). Keeping these issues in mind, we continue to explore this method. Now we try to zoom in to the region which might contain patches in cell images.
for (i in 1:5){
image = readImage(paste('images\\', sample_images[i], sep = ''))
nucSeed = getRegions(image)
matrix = as.data.frame(table(nucSeed))
matrix = matrix[matrix$Freq > 15,]
true_indexes = vector()
if (nrow(matrix) >= 3)
{
for (j in 3:nrow(matrix))
{
blob = matrix$nucSeed[j]
intensity = matrix$Freq[j]
true_indexes = rbind(true_indexes, which(imageData(nucSeed) == blob, arr.ind = TRUE))
}
zoomx = range(true_indexes[,1])
zoomy = range(true_indexes[,2])
#zoomed = rbind(zoomed, c(zoomx, zoomy))
zoomed_in = image[zoomx[1]:zoomx[2], zoomy[1]:zoomy[2],]
display(zoomed_in)
}
}
The left figure corresponds to figure 1, the central figure corresponds to figure 2 and the right figure corresponds to figure 3. Figure 4 and figure 5, do not appear here, as they had no region other than the cell boundary. Now we draw their histograms to see, any observable patterns.
for (i in 1:5){
image = readImage(paste('images\\', sample_images[i], sep = ''))
nucSeed = getRegions(image)
matrix = as.data.frame(table(nucSeed))
matrix = matrix[matrix$Freq > 15,]
true_indexes = vector()
if (nrow(matrix) >= 3)
{
for (j in 3:nrow(matrix))
{
blob = matrix$nucSeed[j]
intensity = matrix$Freq[j]
true_indexes = rbind(true_indexes, which(imageData(nucSeed) == blob, arr.ind = TRUE))
}
zoomx = range(true_indexes[,1])
zoomy = range(true_indexes[,2])
#zoomed = rbind(zoomed, c(zoomx, zoomy))
zoomed_in = image[zoomx[1]:zoomx[2], zoomy[1]:zoomy[2],]
hist(zoomed_in)
}
}
Again, we do not observe any patterns in the color combinations, even when the images are zoomed. Thus utilizing this method, we have identified 3 issues.
To solve, problem 2, we can provide a 20 pixel padding around each image, like this,
for (i in 1:5){
image = readImage(paste('images\\', sample_images[i], sep = ''))
someData <- rep(0, (20+dim(image)[1])*(20+dim(image)[2])*dim(image)[3]);
new_image = array(someData, c((20+dim(image)[1]),(20+dim(image)[2]), dim(image)[3]))
new_image[10:(dim(image)[1] +9), 10:(dim(image)[2] + 9),] = image
image = Image(new_image, colormode = 'color')
nucSeed = getRegions(Image(image))
display(colorLabels(nucSeed))
}
We have solved problem 2, but we have no solutions for the other two problems. Thus we try to approach this problem slightly differently, knowing what we already know about the problem.
Now, we tried to utilize the channcels of the image and then identify if there exist different regions, i.e. extracting the channel first (one of RGB) then converting to grayscale to see if the pattterns are visible or not.
image = readImage(paste('images\\', sample_images[5], sep = ''))
display(image)
display(Image(image[,,1], color = 'grayscale'))
display(Image(image[,,2], color = 'grayscale'))
display(Image(image[,,3], color = 'grayscale'))
We see that the green channel, gives a very good idea of the patch. The advantage of converting to graysacle is that we can get rid of the background, and everything unrelated to the patch quite easily.
for (channel in 1:3)
{
channel_img = image[,,channel]
channel_img[channel_img == 0] = 0.9
channel_img[channel_img > 0.25] = 0.9
channel_img = Image(channel_img, color = 'grayscale')
display(channel_img)
}
Just using basic thresholding, we have been able to get rid of the background and cell boundary, which makes it very easy forus to identify the patch. We go ahead and apply this method on whole data set to set up a module. We write the number of black pixels in each channel after thresholding to a csv, so that we dont have to process the images again and again. We have written the data after applying this method to all the images with different threshold values.
threshold_02 = read.csv(file = "data_with_threshold_02.csv")
threshold_02$status = as.factor(threshold_02$status)
threshold_03 = read.csv(file = "data_with_threshold_03.csv")
threshold_03$status = as.factor(threshold_03$status)
threshold_04 = read.csv(file = "data_with_threshold_04.csv")
threshold_04$status = as.factor(threshold_04$status)
threshold_025 = read.csv(file = "data_with_threshold_025.csv")
threshold_025$status = as.factor(threshold_025$status)
threshold_035 = read.csv(file = "data_with_threshold_035.csv")
threshold_035$status = as.factor(threshold_035$status)
train = sample(dim(threshold_02)[1], (dim(threshold_02)[1]*0.7))
threshold_02.train = threshold_02[train,]
threshold_02.test = threshold_02[-train,]
threshold_03.train = threshold_03[train,]
threshold_03.test = threshold_03[-train,]
threshold_04.train = threshold_04[train,]
threshold_04.test = threshold_04[-train,]
threshold_025.train = threshold_025[train,]
threshold_025.test = threshold_025[-train,]
threshold_035.train = threshold_035[train,]
threshold_035.test = threshold_035[-train,]
We have loaded data sets each with different threshold values. Now we go on and train our models. We now plan to use decision trees and random binary trees (with bagging) as our models.
rf_02 = randomForest(status ~ r + g + b , data = threshold_02.train, mtry = 3)
rf_04 = randomForest(status ~ r + g + b , data = threshold_04.train, mtry = 3)
rf_25 = randomForest(status ~ r + g + b , data = threshold_025.train, mtry = 3)
rf_35 = randomForest(status ~ r + g + b , data = threshold_035.train, mtry = 3)
dt_02 = rpart(status ~ r + g + b , data = threshold_02.train, method = 'class')
dt_04 = rpart(status ~ r + g + b , data = threshold_04.train, method = 'class')
dt_025 = rpart(status ~ r + g + b , data = threshold_025.train, method = 'class')
dt_035 = rpart(status ~ r + g + b , data = threshold_035.train, method = 'class')
Now that we have different models each with different threshold, we try each of these on our test data. We write a function which could help in calculate missclassification rate from confusion matrix.
getError = function(table){
return((table[1,2] + table[2,1])/(table[1,1] + table[1,2] + table[2,1] + table[2,2]))
}
thresholds = c(0.2, 0.4, 0.25, 0.35)
errors_rf = vector()
rf_02_test = predict(rf_02, threshold_02.test)
errors_rf = rbind(errors_rf, getError(table(rf_02_test, threshold_02.test$status)))
rf_04_test = predict(rf_04, threshold_04.test)
errors_rf = rbind(errors_rf, getError(table(rf_04_test, threshold_04.test$status)))
rf_025_test = predict(rf_25, threshold_025.test)
errors_rf = rbind(errors_rf, getError(table(rf_025_test, threshold_025.test$status)))
rf_035_test = predict(rf_35, threshold_035.test)
errors_rf = rbind(errors_rf, getError(table(rf_035_test, threshold_035.test$status)))
errors_dt = vector()
dt_02_test = predict(dt_02, threshold_02.test)
dt_02_test = ifelse(dt_02_test[,1] > 0.5, 0, 1)
errors_dt = rbind(errors_dt, getError(table(dt_02_test, threshold_02.test$status)))
dt_04_test = predict(dt_04, threshold_04.test)
dt_04_test = ifelse(dt_04_test[,1] > 0.5, 0, 1)
errors_dt = rbind(errors_dt, getError(table(dt_04_test, threshold_04.test$status)))
dt_025_test = predict(dt_025, threshold_025.test)
dt_025_test = ifelse(dt_025_test[,1] > 0.5, 0, 1)
errors_dt = rbind(errors_dt, getError(table(dt_025_test, threshold_025.test$status)))
dt_035_test = predict(dt_035, threshold_035.test)
dt_035_test = ifelse(dt_035_test[,1] > 0.5, 0, 1)
errors_dt = rbind(errors_dt, getError(table(dt_035_test, threshold_035.test$status)))
Now we try to plot Misclassification rate versus the threshold for random forests and decision trees respectively.
plot(thresholds, errors_rf)
plot(thresholds, errors_dt)
range(errors_rf)[1]
## [1] 0.05454765
range(errors_dt)[1]
## [1] 0.05224964
Thus, decision tree, with threshold of 4 produces the best resut out of all the test modules. Thus we consider that model is best for this problem among the test models.
summary(dt_04)
## Call:
## rpart(formula = status ~ r + g + b, data = threshold_04.train,
## method = "class")
## n= 19289
##
## CP nsplit rel error xerror xstd
## 1 0.8940274 0 1.0000000 1.0109913 0.007200152
## 2 0.0100000 1 0.1059726 0.1062837 0.003230337
##
## Variable importance
## g b r
## 87 10 4
##
## Node number 1: 19289 observations, complexity param=0.8940274
## predicted class=1 expected loss=0.4999741 P(node) =1
## class counts: 9644 9645
## probabilities: 0.500 0.500
## left son=2 (9698 obs) right son=3 (9591 obs)
## Primary splits:
## g < 4.5 to the left, improve=7709.0410, (0 missing)
## b < 0.5 to the left, improve= 435.1310, (0 missing)
## r < 0.5 to the left, improve= 145.2142, (0 missing)
## Surrogate splits:
## b < 0.5 to the left, agree=0.557, adj=0.109, (0 split)
## r < 0.5 to the left, agree=0.523, adj=0.041, (0 split)
##
## Node number 2: 9698 observations
## predicted class=0 expected loss=0.05547536 P(node) =0.5027736
## class counts: 9160 538
## probabilities: 0.945 0.055
##
## Node number 3: 9591 observations
## predicted class=1 expected loss=0.05046398 P(node) =0.4972264
## class counts: 484 9107
## probabilities: 0.050 0.950
As expected, the green channels hold the most importance, and our tree looks like this.
rpart.plot(dt_04)
Thus, we have been able to create model which has an accuracy of 95% in classifying whether a cell is infected or not. We explored different methods, but at the end, channeling the images into each of color component, and then utilizing the color property of the patch helped us in identifying whether a cell is infected or not. We used both random forests and decision trees, but decision tree with threshold produced the best result marginally.